IN or NOT IN conditions are used in FILTER/WHERE or even in JOINS when we have to specify multiple possible values for any column. If the value is one of the values mentioned inside “IN” clause then it will qualify. It is opposite for “NOT IN” where the value must not be among any one present inside NOT IN clause.
So let’s look at the example for IN condition
Create DataFrame
val empDF = spark.createDataFrame(Seq(
(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
(7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
(7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
(7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
(7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
(7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
(7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
(7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
(7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
(7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)
)).toDF("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno")
empDF.show
Filter with department Number
empDF.filter("deptno IN (20,30)").show
empDF.where("job IN ('SALESMAN','MANAGER')").show
Note: “in” method is not available in Spark 2.0. So prefer method is “isin”
Filter with department Number
empDF.filter($"deptno".isin (20,30) ).show
Filter with job description
empDF.filter($"job".isin ("ADAMS","CLERK")).show
Filter with NOT IN
To use the condition as “NOT IN”, you can use negation (!) before the column name in the previous isin query.
empDF.filter("deptno not in (20,30)").show
No comments:
Post a Comment